In [ ]:
from mlxtend.plotting import plot_decision_regions
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
In [ ]:
#Loading the dataset
diabetes_data = pd.read_csv('diabetes.csv')

#Print the first 5 rows of the dataframe.
diabetes_data.head()
Out[ ]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
In [ ]:
diabetes_data.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

DataFrame.describe() function 生成描述性統計數據,總結數據集分佈的集中趨勢、分散度和形狀,不包括 NaN 值。 這個方法告訴我們很多關於數據集的事情。 一件重要的事情是 describe() 方法只處理數值。 它不適用於任何分類值。 因此,如果列中有任何分類值,describe() 方法將忽略它並顯示其他列的摘要,除非傳遞了參數 include="all"。

describe() 方法生成的統計信息:

  • count 告訴我們特徵中非空行的數量。
  • mean 告訴我們該特徵的平均值。
  • std 告訴我們該特徵的標準偏差值。
  • min 告訴我們該特徵的最小值。
  • 25%、50% 和 75% 是每個特徵的百分位數/四分位數。 這個四分位數信息有助於我們檢測異常值。
  • max 告訴我們該特徵的最大值。
In [ ]:
diabetes_data.describe()
Out[ ]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000
In [ ]:
diabetes_data.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
Pregnancies 768.0 3.845052 3.369578 0.000 1.00000 3.0000 6.00000 17.00
Glucose 768.0 120.894531 31.972618 0.000 99.00000 117.0000 140.25000 199.00
BloodPressure 768.0 69.105469 19.355807 0.000 62.00000 72.0000 80.00000 122.00
SkinThickness 768.0 20.536458 15.952218 0.000 0.00000 23.0000 32.00000 99.00
Insulin 768.0 79.799479 115.244002 0.000 0.00000 30.5000 127.25000 846.00
BMI 768.0 31.992578 7.884160 0.000 27.30000 32.0000 36.60000 67.10
DiabetesPedigreeFunction 768.0 0.471876 0.331329 0.078 0.24375 0.3725 0.62625 2.42
Age 768.0 33.240885 11.760232 21.000 24.00000 29.0000 41.00000 81.00
Outcome 768.0 0.348958 0.476951 0.000 0.00000 0.0000 1.00000 1.00

在這些列上,零值沒有意義,因此表示缺失值。

以下列或變量具有無效的零值:

  1. 葡萄糖(Glucose)
  2. 血壓(BloodPressure)
  3. 皮膚厚度(SkinThickness)
  4. 胰島素(Insulin)
  5. 體重指數(BMI)

因此最好用 nan 替換零,因為之後才可以更容易以合適的值替換零

In [ ]:
diabetes_data_copy = diabetes_data.copy(deep = True)
diabetes_data_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] \
= diabetes_data_copy[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

## showing the count of Nans
print(diabetes_data_copy.isnull().sum())
Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

畫出data的distribution,即可得知將 0 replace 成nan的必要性

In [ ]:
p = diabetes_data.hist(figsize = (20,20))

根據列的分佈來估算列的 nan 值

In [ ]:
diabetes_data_copy['Glucose'].fillna(diabetes_data_copy['Glucose'].mean(), inplace = True)
diabetes_data_copy['BloodPressure'].fillna(diabetes_data_copy['BloodPressure'].mean(), inplace = True)
diabetes_data_copy['SkinThickness'].fillna(diabetes_data_copy['SkinThickness'].median(), inplace = True)
diabetes_data_copy['Insulin'].fillna(diabetes_data_copy['Insulin'].median(), inplace = True)
diabetes_data_copy['BMI'].fillna(diabetes_data_copy['BMI'].median(), inplace = True)

畫出after移除Nan的圖

In [ ]:
p = diabetes_data_copy.hist(figsize = (20,20))

偏度¶

左偏分佈有一條長長的左尾。 左偏分佈也稱為負偏分佈。 那是因為數軸上負方向有一條長長的尾巴。 平均值也在峰值的左側。

右偏分佈有一條長長的右尾。 右偏分佈也稱為正偏分佈。 那是因為數軸上正方向有一條長長的尾巴。 均值也在峰值的右側。

In [ ]:
## observing the shape of the data
diabetes_data.shape
Out[ ]:
(768, 9)
In [ ]:
## checking the balance of the data by plotting the count of outcomes by their value
color_wheel = {1: "#0392cf", 
               2: "#7bc043"}
colors = diabetes_data["Outcome"].map(lambda x: color_wheel.get(x + 1))
print(diabetes_data.Outcome.value_counts())
p=diabetes_data.Outcome.value_counts().plot(kind="bar")
0    500
1    268
Name: Outcome, dtype: int64

上圖顯示數據偏向結果值為 0 的數據點,這意味著實際上不存在糖尿病。 非糖尿病患者的數量幾乎是糖尿病患者的兩倍¶

未清理數據的散點矩陣¶

In [ ]:
from pandas.plotting import scatter_matrix
p=scatter_matrix(diabetes_data,figsize=(25, 25))
成對圖建立在直方圖和散點圖這兩個基本圖形之上。 對角線上的直方圖讓我們可以看到單個變量的分佈,而上下三角形上的散點圖顯示兩個變量之間的關係(或缺乏關係)。¶

乾淨數據的配對圖¶

In [ ]:
p=sns.pairplot(diabetes_data_copy, hue = 'Outcome')

*Pearson's Correlation Coefficient*: 幫助您找出兩個量之間的關係。 它為您提供了兩個變量之間關聯強度的度量。Pearson's Correlation Coefficient 的值可以介於 -1 到 +1 之間。 1表示它們高度相關,0表示不相關

熱圖是藉助顏色的二維信息表示。 熱圖可以幫助用戶可視化簡單或複雜的信息。

Heatmap for unclean data¶

In [ ]:
plt.figure(figsize=(12,10))  # on this line I just set the size of figure to 12 by 10.
p=sns.heatmap(diabetes_data.corr(), annot=True,cmap ='RdYlGn')  # seaborn has very simple solution for heatmap

Heatmap for clean data¶

In [ ]:
plt.figure(figsize=(12,10))  # on this line I just set the size of figure to 12 by 10.
p=sns.heatmap(diabetes_data_copy.corr(), annot=True,cmap ='RdYlGn')  # seaborn has very simple solution for heatmap

縮放數據¶

data Z is rescaled such that μ = 0 and 𝛔 = 1, and is done through this formula:

In [ ]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X =  pd.DataFrame(sc_X.fit_transform(diabetes_data_copy.drop(["Outcome"],axis = 1),),
        columns=['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'])
In [ ]:
X.head()
Out[ ]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
0 0.639947 0.865108 -0.033518 0.670643 -0.181541 0.166619 0.468492 1.425995
1 -0.844885 -1.206162 -0.529859 -0.012301 -0.181541 -0.852200 -0.365061 -0.190672
2 1.233880 2.015813 -0.695306 -0.012301 -0.181541 -1.332500 0.604397 -0.105584
3 -0.844885 -1.074652 -0.529859 -0.695245 -0.540642 -0.633881 -0.920763 -1.041549
4 -1.141852 0.503458 -2.680669 0.670643 0.316566 1.549303 5.484909 -0.020496
In [ ]:
#X = diabetes_data.drop("Outcome",axis = 1)
y = diabetes_data_copy.Outcome

而為什麼要縮放數據呢?¶

在應用基於距離的算法(如 KNN)時,始終建議將所有特徵置於相同的比例。¶

讓我們看一個使用兩個大小/範圍變化很大的特徵的距離計算示例。¶

Euclidean Distance = [(100000–80000)^2 + (30–25)^2]^(1/2)

我們可以想像具有更大範圍的特徵如何完全遮蓋或縮小較小的特徵,這將影響所有基於距離的模型的性能,因為它會給具有更高幅度的變量更高的權重。¶

測試訓練拆分和交叉驗證方法¶

*Train Test Split* : 使用未知數據點來測試數據,而不是使用與訓練模型相同的點進行測試。 這有助於更好地捕捉模型性能。

*Cross Validation*: 當模型分為訓練和測試時,特定類型的數據點可能完全進入訓練或測試部分。 這將導致模型表現不佳。 因此,通過交叉驗證技術可以很好地避免過擬合和欠擬合問題

*About Stratify* : Stratify 參數進行拆分,以便生成的樣本中值的比例與提供給參數 stratify 的值的比例相同。

例如,如果變量 y 是值為 0 和 1 的二進制分類變量,並且有 25% 的零和 75% 的 1,則 stratify=y 將確保您的隨機拆分具有 25% 的 0 和 75% 的 1。

In [ ]:
#importing train_test_split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=1/3,random_state=42, stratify=y)
In [ ]:
from sklearn.neighbors import KNeighborsClassifier


test_scores = []
train_scores = []

for i in range(1,15):

    knn = KNeighborsClassifier(i)
    knn.fit(X_train,y_train)
    
    train_scores.append(knn.score(X_train,y_train))
    test_scores.append(knn.score(X_test,y_test))
In [ ]:
## score that comes from testing on the same datapoints that were used for training
max_train_score = max(train_scores)
train_scores_ind = [i for i, v in enumerate(train_scores) if v == max_train_score]
print('Max train score {} % and k = {}'.format(max_train_score*100,list(map(lambda x: x+1, train_scores_ind))))
Max train score 100.0 % and k = [1]
In [ ]:
## score that comes from testing on the datapoints that were split in the beginning to be used for testing solely
max_test_score = max(test_scores)
test_scores_ind = [i for i, v in enumerate(test_scores) if v == max_test_score]
print('Max test score {} % and k = {}'.format(max_test_score*100,list(map(lambda x: x+1, test_scores_ind))))
Max test score 76.5625 % and k = [11]

Result Visualisation¶

In [ ]:
import pandas as df

plt.figure(figsize=(12,5))
p = sns.lineplot(df.DataFrame(train_scores, range(1,15)),marker='*',label='Train Score')
p = sns.lineplot(df.DataFrame(test_scores, range(1,15)),marker='o',label='Test Score')

The best result is captured at k = 11 hence 11 is used for the final model¶

In [ ]:
#Setup a knn classifier with k neighbors
knn = KNeighborsClassifier(11)

knn.fit(X_train,y_train)
knn.score(X_test,y_test)
Out[ ]:
0.765625
In [ ]:
value = 20000
width = 20000
plot_decision_regions(X.values, y.values, clf=knn, legend=2, 
                      filler_feature_values={2: value, 3: value, 4: value, 5: value, 6: value, 7: value},
                      filler_feature_ranges={2: width, 3: width, 4: width, 5: width, 6: width, 7: width},
                      X_highlight=X_test.values)

# Adding axes annotations
#plt.xlabel('sepal length [cm]')
#plt.ylabel('petal length [cm]')
plt.title('KNN with Diabetes Data')
plt.show()

分析model的performance¶

1. Confusion Matrix¶

The confusion matrix is a technique used for summarizing the performance of a classification algorithm i.e. it has binary outputs.

In [ ]:
#import confusion_matrix
from sklearn.metrics import confusion_matrix
#let us get the predictions using the classifier we had fit above
y_pred = knn.predict(X_test)
confusion_matrix(y_test,y_pred)
pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
Out[ ]:
Predicted 0 1 All
True
0 142 25 167
1 35 54 89
All 177 79 256
In [ ]:
y_pred = knn.predict(X_test)
from sklearn import metrics
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
p = sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[ ]:
Text(0.5, 20.049999999999997, 'Predicted label')

2. Classification Report¶

Report which includes Precision, Recall and F1-Score.

Precision Score¶

    TP – True Positives
    FP – False Positives

    Precision – Accuracy of positive predictions.
    Precision = TP/(TP + FP)


Recall Score¶

    FN – False Negatives

    Recall(sensitivity or true positive rate): Fraction of positives that were correctly identified.
    Recall = TP/(TP+FN)

F1 Score¶

    F1 Score (aka F-Score or F-Measure) – A helpful metric for comparing two classifiers.
    F1 Score takes into account precision and the recall. 
    It is created by finding the the harmonic mean of precision and recall.

    F1 = 2 x (precision x recall)/(precision + recall)



*Precision* - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate. We have got 0.788 precision which is pretty good.

Precision = TP/TP+FP

*Recall (Sensitivity)* - Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. The question recall answers is: Of all the passengers that truly survived, how many did we label? A recall greater than 0.5 is good.

Recall = TP/TP+FN

*F1 score* - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but F1 is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both Precision and Recall.

F1 Score = 2(Recall Precision) / (Recall + Precision)

For Reference: http://joshlawman.com/metrics-classification-report-breakdown-precision-recall-f1/ : https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/

In [ ]:
#import classification_report
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))
              precision    recall  f1-score   support

           0       0.80      0.85      0.83       167
           1       0.68      0.61      0.64        89

    accuracy                           0.77       256
   macro avg       0.74      0.73      0.73       256
weighted avg       0.76      0.77      0.76       256

3. ROC - AUC¶

ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a patient has a disease or no). Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two things.

In [ ]:
from sklearn.metrics import roc_curve
y_pred_proba = knn.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
In [ ]:
plt.plot([0,1],[0,1],'k--')
plt.plot(fpr,tpr, label='Knn')
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('Knn(n_neighbors=11) ROC curve')
plt.show()
In [ ]:
#Area under ROC curve
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_pred_proba)
Out[ ]:
0.8193500639171096

Hyper Parameter optimization¶

Grid search is an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid.

Let’s consider the following example:

Suppose, a machine learning model X takes hyperparameters a1, a2 and a3. In grid searching, you first define the range of values for each of the hyperparameters a1, a2 and a3. You can think of this as an array of values for each of the hyperparameters. Now the grid search technique will construct many versions of X with all the possible combinations of hyperparameter (a1, a2 and a3) values that you defined in the first place. This range of hyperparameter values is referred to as the grid.

Suppose, you defined the grid as: a1 = [0,1,2,3,4,5] a2 = [10,20,30,40,5,60] a3 = [105,105,110,115,120,125]

Note that, the array of values of that you are defining for the hyperparameters has to be legitimate in a sense that you cannot supply Floating type values to the array if the hyperparameter only takes Integer values.

Now, grid search will begin its process of constructing several versions of X with the grid that you just defined.

It will start with the combination of [0,10,105], and it will end with [5,60,125]. It will go through all the intermediate combinations between these two which makes grid search computationally very expensive.

In [ ]:
#import GridSearchCV
from sklearn.model_selection import GridSearchCV
#In case of classifier like knn the parameter to be tuned is n_neighbors
param_grid = {'n_neighbors':np.arange(1,50)}
knn = KNeighborsClassifier()
knn_cv= GridSearchCV(knn,param_grid,cv=5)
knn_cv.fit(X,y)

print("Best Score:" + str(knn_cv.best_score_))
print("Best Parameters: " + str(knn_cv.best_params_))
Best Score:0.7721840251252015
Best Parameters: {'n_neighbors': 25}
In [ ]: